Purpose

Find MDCs associated with Medicaid and/or Private Insurance payer types.

Analyzing MDC codes from all admissions in HCUP NY SID 2006-2012.

MDC Plots

All plots show number of admissions by Major Diagnostic Criteria

Scatterplot

By Counts of Medicaid Admission

MDCs ordered in descending order of Medicaid admission counts

By Prop. of Medicaid Admissions

MDCs ordered in descending order of Medicaid admission proportion (of all admissions)

By Ratio of Medicaid to Private Insurance Admissions

MDCs ordered in descending order of ratio of Medicaid:Private Insurance admissions

K-means

Background & Algorithm

K-means clustering classifies MDCs into k groups such that MDCs within the same cluster are as similar as possible, and MDCs from different clusters are as dissimilar as possible. For our data, similarity is represented by the number of discharges/admissions from each payer type.

K-means defines clusters by trying to minimize the total within-cluster variation. The standard algorithm (Hartigan-Wong (1979)) defines the within-cluster variation as the sum of squared Euclidean distances between each MDC and its corresponding cluster centroid:

\[W(C_k) = \sum_{x_i \in C_k} (x_i - \mu_k)^2\]

where:

  • \(x_i\) is an MDC belonging to cluster \(C_k\)
  • \(\mu_k\) is the mean value of the MDCs assigned to cluster \(C_k\). This is a vector of the means of all discharges by payer type for all MDCs in the cluster.

The algorithm tries the minimize the total within-cluster varition:

\[Total.Within.SS = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\]

K-means algorithm can be summarized as:

  1. Specify the number of clusters (k).
  2. Select randomly k MDCs from the data as the initial cluster centroid/means.
  3. Assigns each MDC to their closest centroid, based on Euclidean distance.
  4. For each of the k clusters update the cluster centroid by recalculating mean values of all MDCs in the cluster.
  5. Iteratively minimize the total within sum of squares. I.e. repeat steps 3 and 4 until cluster assignments stop changing or a user-specified maximum number of iterations is reached.

Cluster Results

Implemented k-means clustering for \(k=[2,15]\). Below are plots of clusters for \(k=[2,6]\). Usually these plots are projected on the first two primary components, but we are specifically interested in two specific dimensions (Medicaid & Private Insurance admissions).

Determining Optimal Clusters

Recall k-means defines clusters by minimizing the the total within-cluster variation (Total.Within.SS). We can plot the Total.Within.SS against the number of clusters k to decide the optimal number of clusters.

As k increases, the Total.Within.SS approaches 0. Generally, researchers use the “elbow method” for finding the value of k where the line bends as the point where there are diminishing returns in reducing the Total.Within.SS.

The above scree plot implies that k=5 is the optimal number of clusters. However, recall that we are clustering MDCs based on number of discharges per payer type, but we are only interested in trying to find subsets of MDCs that are more associated to either Medicaid or Private Insurance.